-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(store-sync,store-indexer): schemaless indexer #1965
Conversation
🦋 Changeset detectedLatest commit: b02aedc The changes in this PR will be included in the next version bump. This PR includes changesets to release 30 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
12e0781
to
d996cd8
Compare
c15b263
to
837ed6b
Compare
.from(tables.chainTable) | ||
.where(eq(tables.chainTable.chainId, chainId)) | ||
.execute() | ||
.then((rows) => rows.find(() => true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for context, what is the purpose of the (rows) => rows.find(() => true)
line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave some comments around but this basically sidesteps our lack of noUncheckedIndexedAccess: true
.
const rows = query(...);
const row = rows[0]; // type will just be `Row`, which is inaccurate for an empty array
// vs
const row = rows.find(() => true); // type will be `Row | undefined`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohh so it's essentially "give me the first item if it exists"; i had head it as "filter"
packages/store-indexer/package.json
Outdated
@@ -27,7 +27,7 @@ | |||
"lint": "eslint .", | |||
"start:postgres": "concurrently -n indexer,frontend -c cyan,magenta 'tsx bin/postgres-indexer' 'tsx bin/postgres-frontend'", | |||
"start:postgres:local": "DEBUG=mud:store-sync:createStoreSync DATABASE_URL=postgres://127.0.0.1/postgres RPC_HTTP_URL=http://127.0.0.1:8545 pnpm start:postgres", | |||
"start:postgres:testnet": "DEBUG=mud:store-sync:createStoreSync DATABASE_URL=postgres://127.0.0.1/postgres RPC_HTTP_URL=https://follower.testnet-chain.linfra.xyz pnpm start:postgres", | |||
"start:postgres:testnet": "DEBUG=mud:* DATABASE_URL=postgres://127.0.0.1/postgres RPC_HTTP_URL=https://follower.testnet-chain.linfra.xyz pnpm start:postgres", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this change intended to be merged or for dev debug purposes?
const { lastUpdatedBlockNumber } = metadata[0] ?? {}; | ||
const tablesWithRecords: TableWithRecords[] = tables.map((table) => { | ||
const records = logs | ||
.filter((log) => getAddress(log.address) === getAddress(table.address) && log.args.tableId === table.tableId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the filter on address
and tableId
be passed to getLogs
to let postgres handle it instead of filtering the result array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does less querying than we were doing before but for the same amount of data.
Before when records were distributed across many tables, we had to query for a list of tables then query each table individually for their records.
Now we query for all logs (passing the filters to the SQL query), then we just group them by table for the purposes of findAll
.
This is mostly a backwards compat thing for findAll
with the new table structure. The clients will start using the new getLogs
endpoint in a PR coming soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh gotcha, this part is just grouping the logs by table, not really filtering the data. Even though this is mostly used for backwards compatibility, I wonder should we slightly change the logic here to iterate through the logs once and group them by table (ie have an object with address/tableId
as key and array as value, iterate over the logs to push to the right array, and then do Object.values
), instead of iterating through all logs number of table
times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will adjust here but I'd prefer to save these kinds of things for an optimization pass where we can set up some measurement tools etc. (assuming we even want to keep supporting this endpoint)
although it will prob be more performant, I find your suggested approach a bit harder to read and maintain, so it'd be good to know what perf we're trading off in terms of readability vs speed/memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with no premature optimization and instead prioritizing benchmarks to find the actual bottlenecks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah so the other reason I did this like this is because we need the table schema to decode the log, and didn't want to create another lookup mapping
going to keep this as-is for readability for now, and can follow up with measurement
.from(tables.chainTable) | ||
.where(eq(tables.chainTable.chainId, chainId)) | ||
.execute() | ||
.then((rows) => rows.find(() => true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question here as before, why do we need the .then((rows) => rows.find(() => true))
line?
blockNumber = bigIntMax( | ||
blockNumber, | ||
records.reduce((max, record) => bigIntMax(max, record.lastUpdatedBlockNumber ?? 0n), 0n) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, if use blockNumber
as the initial value for the accumulator we wouldn't need the outer bigIntMax
. No strong opinion, could argue the intent is clearer with the explicit outer bigIntMax
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the lastUpdatedBlockNumber
of a record is higher than the chainState.lastUpdatedBlockNumber
, would that mean there is an issue with the way we update chainState.lastUpdatedBlockNumber
or is there a valid situation where it could happen? If it means a bug we should probably log something and aim to remove this step at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only did this here because of the order of queries that are run separately, one for the chain's current block number and one for all the records. I could probably do this at query time (i.e. a single query), but I was being lazy about crafting that query since Drizzle is a bit clunky/annoying with joins.
The issue I am solving for is that the indexer may be up to date for a chain but a world may not have had any recent activity. If we query for just records and return the highest block number of those records, that would signal to the client that it needs to start fetching from the RPC from a much older block number than is necessary.
So basically: we get the chain's block number first, as the "minimum" block number value. Then we get all the records and their highest block number. If it's higher than the chain, then we know some data updated between our two queries and we should use the higher number. If it's lower than the chain, we use the chain block number, because that's where the indexer is at in terms of up-to-date data.
eventName: "Store_SetRecord", | ||
args: { | ||
tableId: record.tableId, | ||
keyTuple: decodeDynamicField("bytes32[]", record.keyBytes), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at some point we should do benchmarks of this hot path, I hope the decoding here doesn't become a bottleneck if a large number of records is requested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt it, decoding individual fields is fairly lightweight. This one just chunks the concatenated hex into individual bytes32 hex items. There's room to optimize that method but I doubt this is where our bottleneck is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reason why i'm saying this is because we had performance issues with string manipulations in the past (back in network stack v1 times), but I agree we should definitely benchmark first to find the bottlenecks
import { setupTables } from "./setupTables"; | ||
import { StorageAdapter, StorageAdapterBlock } from "../common"; | ||
|
||
// Currently assumes one DB per chain ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this comment still accurate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was confused about the chain
table but I now understand the expectation is to have only one row in that table, and we don't have a chainId
column in the data table so couldn't support multiple chains in the same data table with the current schema
); | ||
|
||
const logs = records | ||
.filter((record) => !record.isDeleted) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we just include !isDeleted
in the sql query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intentionally didn't for a few reasons:
- on the whole, I expect the number of actual deleted records to be low, because it costs gas to delete vs leaving a record laying around
- doing it at the query level will change the query planning and I want to make sure we hit the indexes I've set up here (not to say that we can't but I wanna make sure we can make use of the indexes)
- we'll be doing similar shaped queries for the update stream (query all records and return whether a record is deleted) and seems wise to use the same SQL query path for keeping cache warm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on the whole, I expect the number of actual deleted records to be low, because it costs gas to delete vs leaving a record laying around
Independent of the number of deleted records it would save us one linear operation over the entire logs array if we don't have to filter for isDeleted
after fetching all the logs. If we'd add another index for isDeleted
we could avoid that linear operation, at the cost of more expensive inserts (but since inserts are throttled by the blockchain, write performance is less critical than read performance)
we'll be doing similar shaped queries for the update stream (query all records and filter or return whether a record is deleted) and seems wise to use the same SQL query path for caching
For the update stream the query would also include the latest block number no? So wouldn't it already be a slightly different query shape?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the update stream the query would also include the latest block number no? So wouldn't it already be a slightly different query shape?
ah yeah, that's true
mind if we come back to this? it would be nice to set up some scaffolding for testing/measuring query performance, processing performance (SQL query -> HTTP response), etc. and I don't want to blow up this PR
some of this might be easier if we move to static SQL queries (rather than ORM) that we can run EXPLAIN on and include as part of snapshot testing etc. to make sure indices are being hit in the way we expect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah definitely, this is not blocking, approved the PR #1965 (review)
lastUpdatedBlockNumber: asBigInt("last_updated_block_number", "numeric"), | ||
}, | ||
(table) => ({ | ||
pk: primaryKey(table.address, table.tableId, table.keyBytes), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the primary key is the combination of these three columns, do we need individual indices on address
and tableId
for efficient querying on those (where we don't include the keyBytes
in the query)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the indexes below this handle that and I believe that an index on [address, tableId, key0]
can still benefit a query using address=... AND tableId=...
pk: primaryKey(table.address, table.tableId, table.keyBytes), | ||
key0Index: index("key0_index").on(table.address, table.tableId, table.key0), | ||
key1Index: index("key1_index").on(table.address, table.tableId, table.key1), | ||
// TODO: add indices for querying without table ID | ||
// TODO: add indices for querying multiple keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also add an index for isDeleted
to be able to efficiently use it in getLogs
instead of having to filter the array in js?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my comments are mostly about performance, which is hard to fully reason about without benchmarks and its not blocking for functionality, so I'm happy merging as is. But I think we should prioritise adding a benchmark for getLogs
on a table with lots of rows so we can more efficiently identify bottlenecks in hot paths.
closes #1640
The client libs don't use the indexer's new
getLogs
method yet, planning to do that in a follow up PR.